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Abstract 

Background: Structural variations in human genomes, such as deletions, play an important role in cancer 
development. Next-Generation Sequencing technologies have been central in providing ways to detect such 
variations. Methods like paired-end mapping allow to simultaneously analyze data from several samples in order to, 
e.g., distinguish tumor from patient specific variations. However, it has been shown that, especially in this setting, 
there is a need to explicitly take overlapping deletions into consideration. Existing tools have only minor 
capabilities to call overlapping deletions, unable to unravel complex signals to obtain consistent predictions. 

Result: We present a first approach specifically designed to cluster short-read paired-end data into possibly 
overlapping deletion predictions. The method does not make any assumptions on the composition of the data, 
such as the number of samples, heterogeneity, polyploidy, etc. Taking paired ends mapped to a reference genome 
as input, it iteratively merges mappings to clusters based on a similarity score that takes both the putative location 
and size of a deletion into account. 

Conclusion: We demonstrate that agglomerative clustering is suitable to predict deletions. Analyzing real data from 
three samples of a cancer patient, we found putatively overlapping deletions and observed that, as a side-effect, 
erroneous mappings are mostly identified as singleton clusters. An evaluation on simulated data shows, compared to 
other methods which can output overlapping clusters, high accuracy in separating overlapping from single deletions. 



Introduction 

It is well known that mutations in the human genome are 
associated to diseases such as cancer. Besides small scale 
mutations like single nucleotide variants, larger events 
such as deletions, insertions, inversions, or inter-chromo- 
somal rearrangements can have a crucial impact on the 
initiation and development of cancer. The detection and 
analysis of these structural variations play an important 
role in understanding the underlying mechanisms of can- 
cer, its diagnosis and treatment [1-4]. 

Current sequencing technologies allow to obtain high 
data volumes at low cost. It has now become affordable 
to sequence several samples of the same patient, enabling 
comparative analyses of, e.g., tumor cells versus healthy 
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blood cells, or samples taken before versus after treat- 
ment, to distinguish tumor from patient specific varia- 
tions, or to observe structural variations over time [5,6]. 

In the analysis of such complex data, it is important to 
consider heterogeneity of various kinds [7]. Besides the 
differences between several tissues or time points, in can- 
cer one always has to face heterozygosity (mutations only 
affecting one allele), loss of heterozygosity (mutation inac- 
tivating the second allele), aneploidy (different copy num- 
bers for some chromosomes), copy number alterations 
(different copy numbers for parts of chromosomes), differ- 
entiation of tumor cell lines developing different varia- 
tions, etc. An additional challenge is that a cancer sample 
is most likely a mixed sample, i.e., although taken from 
tumor tissue, it usually contains also normal cells [2,8,9] . 

For the detection of single nucleotide variants (SNVs), 
there exist several approaches, some of which address 
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the above issues. For instance, SomaticSniper [10], Join- 
tSNVMix [11] and MutationSeq [12] call somatic SNVs 
from pairs of normal and tumor samples, where the first 
two methods follow a Bayesian approach to distinguish 
tumor from patient specific SNVs, and the latter builds 
on clustering by support vector machines. Strelka [13] 
explicitly models mixtures of tumor and normal cells 
and can also call small indels. Also, several tools exist to 
accurately detect SNVs in pooled data [14-17], even 
mutations of low abundance. Apart from analyzing sin- 
gle SNVs, also haplotype inference and assembly has 
been addressed [18-20]. For the analysis of gene expres- 
sion data, also Bayesian approaches have been proposed, 
even considering subtypes of cancer [21]. 

In contrast to the analysis of SNVs, for the detection of 
somatic deletions and other larger structural variations, 
one usually has to process the different samples separately 
and to compare the results afterwards, e.g., subtract dele- 
tions found in the healthy sample from those found in the 
tumor sample. Or one can pool the data and afterwards 
only select those calls solely based on tumor data [22]. 
Only recently, joint analysis of several data sets have been 
proposed [5,6]. 

As shown in a preliminary study [5] to detect deletions 
by a combined analysis of samples from tumor and healthy 
tissue, there were regions in the tumor genome for which 
existing tools predicted more deletions than there could 
actually be on a diploid genome. When instead two diploid 
sets of chromosomes were assumed, i.e., the tumor sample 
is actually a mixture of cancerous and healthy cells, almost 
all data could be explained consistendy, by explicitly mod- 
eling heterozygous deletions on different alleles to be over- 
lapping. These observations have been made particularly 
in regions where deletions were found in the healthy cells 
and additional deletions have been predicted in the tumor 
sample - thus in regions where it is especially difficult to 
distinguish cancer from patient specific mutations. How- 
ever, the scope of the presented method has been to show 
that a consistent scenario of overlapping deletions can be 
found. It greedily constructs a "possible" solution, without 
raising the claim of reporting a "reasonable" result. 
Furthermore, the model is restricted to the very specific 
case of analyzing a mixture of cancer and normal cells, 
including some additional technical, combinatorial 
assumptions. 

In this paper, we present a method to detect deletions 
that is particularly designed to handle overlapping dele- 
tions. For the sake of flexibility, no assumption on the 
composition of the data is made - whether it is just from 
one sample or pooled data, it contains different cell lines, 
aneploid cells, etc. Being aware of such heterozygosity and 
specially designed for such data, we refrain from predict- 
ing a deletion being "heterozygous or homozygous", or 
"tumor or patient specific". Instead, besides a tabular 



listing of the results, a rich visual output for each set of 
overlapping deletions is provided, allowing an easy inspec- 
tion of the results. 

The method takes as input paired ends that have been 
mapped to a reference genome and collects those map- 
pings likely originating from the same deletion in clusters. 
Agglomerative clustering is utilized to cluster mappings by 
similarity. The similarity score is based on a three-dimen- 
sional representation of mappings and deletions similar to 
the two-dimensional representation introduced by Dew et 
al. [23] and used for structural variation detection by Sindi 
et al. [22] in the tool GASV. With this representation, we 
obtain a score that captures both the putative location and 
size of the deletion. 

We applied our method to a data set from several sam- 
ples taken from an acute lymphoblastic leukemia patient. 
Besides examples for predicted overlapping deletions, we 
find that single, putatively erroneous mappings not 
assigned to another cluster can nicely be identified as 
outliers. Since overlapping deletions are rare events and 
their verification is difficult, we performed an evaluation 
on simulated data showing good accuracy. 

After providing the necessary background in the fol- 
lowing section, we introduce our approach in Section 
"Method". In Section "Results and discussion", we pre- 
sent a simulation-based evaluation and results on real 
data, before we conclude our study. 

Background 

We will now give a brief overview of existing approaches 
to identify structural variations and then introduce the 
technique of paired-end mapping, which our method is 
based on. 

Structural variation detection 

Besides different experimental techniques, there are 
many computational approaches for structural variation 
detection [24]. A straight forward idea to detect muta- 
tions would be to fully assemble the genome under con- 
sideration, the so-called donor genome, and to align it to 
a reference sequence. To save the time and cost intensive 
assembly and finishing steps which would be necessary to 
determine the full genome sequence, one usually follows 
other approaches. Instead of performing a full assembly, 
one can restrict the process to only reads from regions 
suspect to harbor a variation, as for instance done in 
[25]. The tool fermi [26] allows both full assembly and a 
pre-filtering for reads unique to one sample. Other recent 
methods simultaneously assemble several genomes into a 
single graph data structure allowing for variation calling 
[27,28]. 

Another, more common approach is to omit any 
assembly of the donor genome and instead utilize the 
reads directly to detect differences by mapping them to 
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the finished reference sequence. Basically, there are three 
classes of methods to identify structural variations from 
those mappings. (See [24] or [29] for reviews.) (1) Signifi- 
cant fluctuations of the coverage of the reference by map- 
pings can indicate copy number changes. If part of a 
chromosome is lost or duplicated, the coverage drops or 
increases, respectively. (2) If a read has not been mapped 
completely, but in parts, this can indicate different types 
of mutations. E.g., if one half of the split read is mapped 
with some space to the remaining part, the segment in- 
between might be absent in the donor genome. If parts 
are mapped to different chromosomes, this can indicate 
inter-chromosomal rearrangements. (3) If the donor gen- 
ome has been sequenced using a certain technology, 
pairs of reads (paired ends) are obtained. The orientation 
and the distance of the reads within a pair are known in 
the donor, and when the corresponding mappings on the 
reference do not agree with this pattern, structural varia- 
tions can be called similar to the split read approach. 
This paired-end mapping technique, introduced by 
Korbel et al. [30], will be explained in detail in the follow- 
ing section. 

Since all techniques have their advantages and disadvan- 
tages, in general, a combination of different techniques is 
advisable. Different tools can be applied separately and the 
results are combined in a post-processing step. Also tools 
exist which combine techniques already during the detec- 
tion, e.g. inGAPsv [31], CNVer [32], GASVPro [33], 
SVseq2 [34], or the method by Nord et al. [35]. The 
method presented here is based on paired-end mapping. 
It is particularly designed for regions harboring several 
overlapping deletions. Due to the complexity of such 
regions, a partial assembly or coverage analysis is hardly 
possible. However, integrating a split read approach is 
planned for future work. 

Deletion detection by paired-end mapping 

When genomic material from a sample is sequenced 
by short-read sequencing, many overlapping DNA- 
fragments are produced and a certain number of bases 
are read from both ends of each fragment, resulting in a 
pair of reads, so-called paired ends. The reading direction 
and the length of the reads are known. Further, since the 
fragment size is fixed - in practice distributed around the 
desired length - their approximate distance is known as well. 

The paired ends from a newly sequenced donor genome 
can be mapped to a reference genome which is already 
assembled to a complete DNA-sequence. In a region 
where the two genomes do not differ, the mapped reads 
have the original direction and their distance coincides 
with the fragment length. Such a mapping is called concor- 
dant. If however a mapping is discordant, i.e., either the 
orientation is inconsistent or the distance differs 



significantly from the expected fragment size, this indicates 
a structural variation in the donor w.r.t. the reference. 

In this paper, we only focus on deletions, i.e., a segment 
present in the reference is not present in - we say deleted 
from - the donor genome. Hence, paired ends spanning 
the deletion breakpoint in the donor genome will be 
mapped to the reference with a distance increased by the 
size of the deletion but with proper orientation. We call 
such a mapping stretched. 

Since the fragment length is only known approxi- 
mately, the size of a deletion cannot be determined 
exactly by paired-end mapping. It cannot even be exactly 
decided whether a mapping originates just from a very 
long fragment or is really discordant due to a deletion. 
One approach is to fix a minimum and maximum 
expected mapping distance [D min and D max ), e.g., mean 
distance plus/minus three times the standard deviation. 
Then, a mapping with distance d is considered being 
stretched if d >D max , and the size of the putative deletion 
is expected to be between d - D max and d - D min . Instead 
of this discretization, other methods [36,37], such as 
ours, keep and make use of the information that the frag- 
ment length and thus the mapping distances can usually 
be approximated by a normal distribution with some 
mean ^. (Figure 1 shows a histogram of the fragment 
length distribution in the data set discussed in Section 
"Real data".) The expected deletion size is then also mod- 
eled to be normally distributed around d - p. 

Based on one of the above or a similar model, many 
tools, e.g. [5,6,22,33,37,38], assign all given mappings to 
clusters, where each cluster corresponds to one deletion 
that is supported by all mappings in the cluster. Figure 2 
shows an example for such a deletion cluster. 

Method 

In a region that possibly harbors several deletions, we 
obtain paired ends from the different alleles which are 
then all mapped to the same reference sequence. Depend- 
ing on how similar the deletions are in terms of location 
and size, it is difficult to separate the mappings and to 
recover the different deletions. Our goal is to partition the 
set of mappings into clusters of similar mappings probably 
belonging to the same deletion. To this end, we utilize the 
technique of agglomerative clustering, which in general 
works as follows. At the beginning, each object (in our 
case stretched mapping) is a singleton cluster, and a simi- 
larity score is computed for all pairs of clusters. Then, 
iteratively, a pair of clusters of maximum similarity is 
merged, the two original clusters are replaced by this new 
cluster, and the similarity between the new cluster and all 
others is recomputed. These merging steps are repeated 
until either only one cluster is left, or the maximum simi- 
larity is below a certain threshold. 
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Figure 1 Length distribution of the paired-end sequenced fragments. Distribution from chromosome 21 of the sample "before treatment", 
discussed in Section "Real data". The length values have been extracted from the mapped paired-ends, where only unambiguously and exactly 
aligned pairs have been considered. Strictly speaking, also discordant mappings (those, spanning insertions or deletions) are contained. But due 
to their low abundance, they do not affect the distribution Significantly. 



In the following, we will first explain how we model 
deletion clusters, and then introduce our similarity score, 
which is crucial for accurate clustering. Finally, we give 
some details about our implementation. 



Deletion clusters 

The similarity score used in our clustering approach fol- 
lows an idea introduced by Dew et al. [23] for the analysis 
of mate pairs in assemblies and used for structural 
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(a) Fixed deletion boundaries (b) Probabilistic deletion coordinates 

Figure 2 Deletion detection by paired-end mapping. Paired ends obtained from the donor genome (top) are mapped to the reference 
genome (bottom). Since the paired ends span a deletion breakpoint in the donor, the mappings are stretched. The deleted part in the 
reference genome is shown in gray. 
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variation detection by Sindi et al. [22] . Based on the discre- 
tized approach explained in "Background" (minimum and 
maximum concordant mapping distance D min and D max ), 
mappings and the implied location and size of the deletion 
are represented as a trapezoid in the two-dimensional 
space: The right end of the left read defines the left bound- 
ary of the start coordinate of the deletion (on the x-axis), 
and the left end of the right read the right boundary of the 
end coordinate (on y-axis). Between these, the minimum 
and maximum expected deletion size d - D max and 
d - D min span an area of "allowed" deletion coordinates. 
See Figure 3(a) for an example. If the trapezoids of several 
mappings overlap, the intersecting area (again a trapezoid) 
contains exactly the coordinates of all deletions supported 
by the mappings. The tool GASV [22] efficiently computes 
all maximal sets of pairwise intersecting trapezoids. Either 
all these clusters are output (option "maximal") or overlap- 
ping clusters are combined to single deletion calls (default). 
Inversions are detected by GASV in a similar fashion. 

For our similarity score, we build on this two-dimen- 
sional representation of deletion coordinates, as it nicely 
incorporates both the location and the size of a deletion. 
Again, we use the x- and y-axis for the deletion start and 
end coordinates, respectively. But to model the deletion 
size, instead of the sharply bounded interval, we use a 



continuous distribution. As mentioned before, assuming 
that the fragment length follows a normal distribution 
with mean y, and standard deviation a, the expected dele- 
tion size implied by a mapping with distance d is nor- 
mally distributed around d - fi with the same standard 
deviation a. In the geometric interpretation, for each pos- 
sible start coordinate for a deletion on the x-axis, this dis- 
tribution describes the expected end coordinate on the 
y axis. Since the deletion has to be located in-between 
the mappings of the paired ends, these, together with the 
main diagonal, define a triangle of "allowed" deletion 
coordinates, which is covered in the third dimension by 
the normal distribution as shown in Figure 3(b). In this 
diagram and all following, the probability will be repre- 
sented by shaded gray toning. 

In the clustering process, we characterize each map- 
ping and also each cluster composed of several mappings 
by the above mentioned parameters: The left and right 
boundaries / and r for the deletion coordinates, and ji 
and a of the normal distribution. We further store the 
number n of mappings in a cluster. For a singleton clus- 
ter, the parameters are directly defined by the mapping's 
characteristics, as described above. When we merge two 
mappings or two clusters Ci and Ci with parameters /„ 
r h n b a, and n b where i = 1 or 2, respectively, we combine 
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(a) Fixed deletion boundaries (b) Probabilistic deletion coordinates 

Figure 3 Geometric interpretation of stretched mappings. In both figures, the deletion start is represented on the x-axis, and the deletion 
end on the y-axis. (a) The coordinates for three possible deletions A, B, and C are shown exemplarily. Assuming that deletion A corresponds to 
the left-most and smallest possible deletion, B to the right-most, and C to the left-most and largest possible deletion, these span a trapezoid 
(shown in gray) which defines the coordinates of all allowed deletions, as introduced by Sindi et al. [22]. (b) Instead of assuming a smallest and 
largest deletion size, a probability for the deletion size and thus for the deletion coordinates is considered, represented by the shaded gray. 
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the parameters to obtain a new cluster C lM as follows. 

"iu2 := «i + n 2 

hu2 :=max{! 1; l 2 ) 

r lu2 :=mm{r 1 ,r 2 } 
tiifii + n 2 fi 2 



/^iu2 := 



'1U2 



"1U2 



"1U2 



With our choice of / and r as the maximum and mini- 
mum, respectively, we follow an "average link" strategy as 
it corresponds to taking the intersection of the triangles. 
This is the most intuitive way since the new cluster should 
represent the "area" of deletion coordinates to those which 
are compatible to both of the original clusters. The equa- 
tion for the joint standard deviation is borrowed from 
population-based statistics for aggregating non-overlap- 
ping sub-populations. If the two clusters agree in the loca- 
tion of the deletion, i.e., fCi - fi 2 - 0, the resulting a 2 u2 is 
the mean of the two standard deviations. The more the 
two clusters disagree, the broader becomes the distribu- 
tion. For reasons we will explain later, this way of joining 
the deviations turned out to be better suitable for our 
similarity score and the resulting performance in cluster- 
ing than using the actual joint variance. 

Similarity score 

For the clustering, we need to define a score function that, 
for a given pair of clusters, defines a value - the more simi- 
lar the clusters are, the higher this score. Our similarity 
score is defined in the range from zero to one. In our case, 
the similarity of clusters, i.e. deletion predictions, depends 
on two factors: The location and the size of the deletion, 
both included in our geometric cluster model. Recall that 
each cluster C defines a volume, say V(C), whose base is 
given by a triangle with a Gaussian shaped "mountain" on 
top. The larger the overlap of two triangles, the more they 
agree in the predicted location of the deletion. The closer 
the ridge of their mountains, the more they agree in the 
deletion size. Based on this observation, we define the 
score as the normalized intersection of the two volumes. 

sim(C 1 ,C 2 ) := — — 

max{V(C 1 ),V(C 2 )} 

This score has the following properties. 

♦ The score is zero if and only if the triangles do not 
overlap, which means that there is no location for a 
deletion compatible with both clusters. 

♦ The score is one if and only if the two clusters are 
equal, because, only if all parameters of the two clus- 
ters are equal, the intersection volume equals the 
maximum. 



♦ It is more sensitive to differences in the deletion size 
than in the location. The reason for this is that a shift 
parallel to the ridge of the volume does not affect the 
intersection volume to such an extent as a shift per- 
pendicular to it. This is a particularly desired behavior 
since we expect a deletion breakpoint being covered 
by several mappings. In some mappings the break- 
point lies more to the left and in others more to the 
right, which - even if all mappings are correct - corre- 
sponds to staggered triangles. Such triangles, even if 
staggered, would have to be clustered and should 
thus not be punished too hard by a low score. In con- 
trast, a difference in the deletion size indicates a dis- 
agreement of the mappings. 

We now come back to the issue how to combine the 
standard deviations of several mappings. As already 
mentioned, we did not choose the more intuitive combi- 
nation by computing the standard deviation of the 
means, i.e. c>l u2 := a 2 /n\u 2 , where (J is the standard 
deviation of the fragment length. This results in nar- 
rower distributions for clusters containing more map- 
pings. On the one hand, this would make sense since 
the prediction is more accurate, the more mappings 
support it. On the other hand, besides computational 
problems due to score values close to zero, this also 
yields an artifact: Clusters containing only a few map- 
pings have a broader volume and are thus more likely 
to be clustered than larger clusters with a narrow 
volume. The order in which clusters are aggregated 
would be dominated by their cardinality, rather than by 
their similarity. To avoid this, we chose our definition of 
c>i U2 to be independent of the cardinality of the clusters. 

It remains to describe the stopping criterion for the 
clustering process. Since the normal distribution is never 
exactly zero, any two clusters that overlap in their base 
(triangle) have a non-empty intersection volume, no mat- 
ter how weak the overlap is. We thus set a minimum 
threshold S min for the similarity score. If no pair of clus- 
ters has similarity larger than S min , the clustering process 
is stopped. To avoid the introduction of a parameter 
which is arbitrarily fixed or has to be given by the user, 
we examine the data in a preprocessing step to determine 
a threshold that separates Significant similarity values 
from noise. For each mapping, we determine the smallest 
non-zero similarity score to any other (overlapping) map- 
ping. Setting S min to the median of these minima showed 
robust accurate behavior in practice, also approved in 
our experiment as explained at the end of "Simulation of 
overlapping deletions". 

Implementation 

In general, the run time complexity of agglomerative 
clustering depends on the similarity score. Using priority 
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queues to store the scores, 0(N 2 log N) pairwise score 
computations have to be performed to cluster N objects. 
Our score can be computed as follows. In each triangle, 
the height of the volume is constant on a line parallel to 
the main diagonal and thus parallel to the hypotenuse 
of the triangle. This allows us to compute the volume 
by traversing a triangle (or the intersection of two trian- 
gles, which is again a triangle) starting from its hypote- 
nuse towards the upper left corner, and summing up 
the product of the length of the line and the height of 
the volume given by the normal distribution (or the 
minimum of two, respectively). Since this takes O (r - 1) 
time and the score computation consists of a constant 
number of volume computations, the overall run time 
complexity of the agglomerative clustering is in O (LN 2 
log A/), where L is the maximum length of a mapping. 

In practice, the run time is dominated by reading the 
input and thus approximately linear in the number of 
mappings. 

We implemented the method in JAVA. The input con- 
sists of one or several BAM files and a simple tabular 
separated file listing the mean segment length and stan- 
dard deviation for each file. Additionally, a color can be 
specified for each BAM file, which is used to visualize the 
mappings in the graphical output. The mappings are read 
from the BAM files using SAMtools [39] and applying 
several filters. Only mappings of high quality (quality 
value at least 20), without gaps, uncapped and without 
co-optimal mapping locations are used. Further, the 
mappings can be filtered by their length. Here, a mini- 
mum length of mean plus three times standard deviation 
showed good performance in practice. 

Before the actual clustering, we partition the mappings 
into so-called regions, maximal subsets such that no map- 
ping from one set overlaps with any mapping in another 
subset. These regions are then clustered independently. 

The output is composed of two parts. (1) A tabular sepa- 
rated file is generated, listing all detected clusters includ- 
ing, besides other details, the breakpoint regions, the mean 
expected deletion size and the standard deviation, and 
number of mappings from each BAM file. The latter infor- 
mation can for instance be used in a post-processing step 
to determine whether a deletion is patient specific: If sev- 
eral samples have been analyzed, for instance "healthy" 
and "tumor", a cluster with a high number of mappings 
from the tumor set but a small number of mappings from 
the healthy set indicates a tumor-specific deletion. Addi- 
tional file 1 lists the mappings for each cluster. (2) A gra- 
phical output of each region is provided in form of 
R-code, which produces PDF graphics as can be seen in 
Figure 4. The triangle for each mapping is shown in the 
color assigned to the corresponding BAM file, and the 
third dimension is indicated by shaded gray tones. Clusters 
with a sufficient number of mappings (threshold chosen 



by the user, default 2) are highlighted by a yellow trape- 
zoid (spanned by /, r, ft-3o, and ^+3<t) labeled with the 
number of mappings. Smaller clusters are not shown and 
their mappings are depicted by dotted lines. The R-code 
can easily be modified to produce customized scalable vec- 
tor graphics. 

The tool, including the source code and example data, is 
available from the Bielefeld University Bioinformatics 
Server: http://bibiserv.cebitec.uni-bielefeld.de/agglodel/ 

Results and discussion 

Before we present results on real data from three sam- 
ples of a cancer patient, we first investigate the accuracy 
of the new clustering method in distinguishing overlap- 
ping from single deletions in simulated data sets. 

Simulation of overlapping deletions 

Since, to the best of our knowledge, no other method 
exists that aims at identifying overlapping deletions, and 
because such instances are difficult to detect or verify in 
the wet lab, there is no gold standard available we could 
use for an evaluation. Instead, we created a simulated data 
set which is based on previously detected single deletions. 

We took chromosome one of the human genome 
(hgl9, GRCh37) and created two diploid copies of it into 
which we introduced deletions. From a list of variations 
downloaded from the Database of Genomic Variants 
[[40], chrl of indel.hgl9.vl0.nov.2010.txt], we first 
sampled 1,000 non-overlapping deletions. For 500 of 
them, we simulated an overlapping deletion, the size of 
which was sampled from the afore mentioned list. 
Paired-end mapping approaches are not suited to detect 
small deletions. Thus, on the one hand, we only sampled 
deletions of a certain minimum length. On the other 
hand, we could not set this threshold too high to be able 
to sample enough non-overlapping deletions in the first 
run. A minimum length of 105 turned out to be a good 
tradeoff. The length distribution is shown in Figure 5. 

From these artificial chromosome copies, we sampled 
Illumina paired ends using the tool SimSeq [41]. We 
selected a typical read length of 100, and to be comparable 
to the data set we analyze in Section "Real data", we chose 
a mean fragment lengths of 300 with standard deviation 
35. We performed several runs with different coverage for 
the different alleles to simulate the two scenarios shown in 
Figure 6: (A) The overlapping deletions are both homozy- 
gous, and (B) one deletion is homozygous and the other is 
heterozygous. Both settings were simulated using 20x and 
60 x sequence coverage, the first of which roughly corre- 
sponds to the real data set (Section "Real data"). We 
mapped the obtained reads back to the original hgl9 
sequence with the mapping software BWA [42]. 

We used our agglomerative clustering approach with 
the filtering of mappings as explained in "Method", and 
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Deletion start 

Figure 4 Region on chromosome 19 possibly harboring two overlapping deletions The mappings of the different data sets (before 
treatment, after treatment, and relapse) are represented as red, blue and green triangles, respectively, where the third dimension is indicated by 
shaded gray tones. The detected deletions are shown as yellow trapezoids. A singleton cluster, which is probably an erroneous mapping, is 
depicted with dotted lines. 
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Figure 5 Distribution of the deletion size used for sampling deletions in the simulation-based evaluation 
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500 times 



500 times 



and 500 times 



and 500 times 



(a) Scenario A (b) Scenario B 

Figure 6 Two scenarios used for sampling deletions in the simulation-based evaluation. Two diploid chromosomes with 500 non- 
overlapping deletions and 500 pairs of overlapping deletions (both homozygous, and homozygous and heterozygous) have been simulated. 



also ran two other tools on the same set of filtered map- 
pings. First of all, we included GASV [22] (release 2.01) 
into our evaluation, as our similarity score is based on an 
extension of the geometric interpretation used there. We 
set the minimum cluster cardinality to 2 and used both, 
the option to output all maximal sets of overlapping trape- 
zoids (see "Method") which we will refer to as GASV_max, 
and the default to output merged clusters, referred to as 
GASV. Although a follow-up version GASVPro [33] has 
been published recently, the available software does not 
yet include all necessary preprocessing tools to process 
BAM files. Secondly, we also ran CLEVER [37] as one of 
the most recent and accurate variation detection tools. 
This method computes a probability for mappings arising 
from the same deletion, and builds a graph with the map- 
pings as vertices and edges for pairs with Significantly high 
probability. Clusters are determined as maximal cliques in 
this graph, which in general allows overlapping deletions 
in the output. Clusters are determined as maximal cliques 
in this graph. 

The results are detailed in the table in the appendix. 
In Figure 7, we summarize the results in form of ROC 
plots showing the accuracy at identifying non-overlap- 
ping deletions and pairs of overlapping deletions in 
terms of true and false positive rates: 



#correctly predicted single del's 
#simulated single del s 



TPRi = 

TPR - - #correctl y predicted - pairs 
2 ~ P 2 ~ #simulated pairs 



FPR, . 



#falsely predicted single del's 
flsim. pairs + #no del. detected as del. 



FPR - ££l - #falsely predicted pairs 

tl Kl ~ N 2 ~ #sim. single del s + #no del. det. as del. 



On average, CLEVER detected 92% of all deletions and 
made only about 1% wrong predictions. Even though max- 
imal cliques output by CLEVER can be overlapping and 
the output can thus contain overlapping deletion predic- 
tions, this has to be understood as a technical consequence 
of the clustering procedure as CLEVER does not aim at 
predicting overlapping clusters. Nevertheless, we tried to 
distinguish between non-overlapping and overlapping 
deletions by simply considering non-overlapping cliques 
and pairs of overlapping cliques. The true and false posi- 
tive rates were below 0.3 for all scenarios (not shown in 
the figure). 



In all settings, agglomerative clustering with the pro- 
posed similarity score proves to be accurate. GASV per- 
forms well, where using the option to output all clusters 
before merging turned out to be slightly more effective in 
this setting. 

As can be seen in the Venn diagram in Figure 8, overall, 
625 pairs of overlapping deletions have been correctly 
identified by all three approaches, 476 were exclusively 
found by our approach, none by GASV, and 190 by 
GASV_max. Figure 9(a) shows a typical example in which 
agglomerative clustering correctly unraveled two overlap- 
ping deletions, whereas GASV (both variants) makes one 
prediction that is rather vague. Figure 9(b) exemplifies that 
in some instances, the two overlapping deletions were so 
similar in location and size that they appear to be indistin- 
guishable. The mean distance (all scenarios) of simulated 
overlapping deletions falsely detected as one deletion was 
about 20% smaller than those correctly detected as two 
deletions; the ratio for the deletion sizes was about the 
same. Further, the resolution was worse for smaller dele- 
tions: Falsely classified deletions were on average 13% 
smaller than correctly classified ones. 

We also investigated, when two overlapping deletions 
have been called correctly, how accurate the mappings ori- 
ginating from the two different alleles have been assigned 
to the two clusters. Ideally, one clusters should only con- 
tain mappings from the first deletion and the other cluster 
only those from the second deletion. On those pairs, cor- 
rectly identified by all three methods, GASV clusters con- 
tained 0.041% mis-assigned mappings, GASV_max 
0.019%, and our clusters contained 0.006%. In total, i.e., 
including the more difficult cases GASV did not detect as 
pairs at all, we observed 1.3% mis-assignments in our clus- 
ters and 4.0% in the GASV_max clusters (averaged over all 
four simulation settings). 

To analyze the similarity threshold used as a stopping 
criterion in our clustering, we performed several runs with 
varying thresholds and summarized the performance in a 
ROC curve, shown in Figure 10. In this simulation setting, 
the threshold that is computed from the data as described 
in Section "Method" turns out to be close to optimal in 
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FPR 

(a) Coverage 20, scenario A 



FPR 

(b) Coverage 20, scenario B 




FPR FPR 

(c) Coverage 60, scenario A (d) Coverage 60, scenario B 

Figure 7 Accuracy of different methods in detecting single or overlapping deletions. For different coverages (20x and 60x), and two 
different simulated scenarios (see Figure 6), the accuracy in distinguishing single from overlapping deletions has been measured. The empty 
symbols represent the true and false positive rate for detecting single deletions (TPR,, FPR,). The solid symbols represent the true and false 
positive rate for detecting pairs of overlapping deletions (TPR 2 , FPR 2 ). 



terms of accurately detecting non-overlapping deletions 
and even optimal for pairs of overlapping deletions. 

Real data 

The Department of Paediatric Oncology, Haematology 
and Immunology at the Diisseldorf University Hospital, 
Germany, provided sequencing data of an acute lympho- 
blastic leukemia patient. Three samples (before treatment, 
after treatment, and relapse) have been sequenced on an 



Illumina HiSeq 2000 with read length 51, segment length 
around 300, and sequence coverage of 6x, 8x and 8x, 
respectively. At the Institute of Medical Informatics at 
University of Munster, Germany, the reads have been 
mapped to hgl9 with BWA [42] (version 0.5.9, at most 3 
mismatches and standard parameters otherwise) and 
Picard [43] has been used to remove duplicates. Figure 1 
exemplarily shows the fragment length distribution for 
chromosome 21 from one of the data sets estimated from 
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GASV - 661 

Figure 8 Identification of overlapping deletions. This Venn diagram shows how many pairs of overlapping deletions have been correctly 
identified by the different tools an all four simulation settings. Note that the areas are not scaled exactly proportional. 
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Deletion start Deletion start 

(a) Deletions separated by agglomerative clustering. (b) Indistinguishable deletions. 

Figure 9 Two instances of the evaluation. The green and red triangles (shown partially) represent the mappings of the two alleles, and the 
third dimension is represented in shaded gray tones. The true deletion coordinates are indicated by white crosses, deletions obtained by 
agglomerative clustering are depicted in yellow, and the blue line shows the breakpoint region predicted by GASV. 
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Figure 10 Accuracy depending on the similarity score used as threshold in the agglomerative clustering Thresholds 10 s , 10 7 , . . ., 10 3 , 

0.002, 0.01, 0.05, and 0.1 have been tested for scenario B (see Figure 6(b)) and coverage 60x. The default threshold as computed from the data 
was approximately 0.002, marked with a * The minimum and maximum thresholds are marked with a + and x, respectively. The true and false 
positive rates for non-overlapping and pairs of overlapping deletions are shown. 



the mapping distances. It approves that the assumption of 
a normal distribution is approximately met. 

Processing the data on a Sun Fire X4600 M2 with dual- 
core AMD processors at 2.6 GHz and 32 GB RAM took 
in total about 8:46 CPU hours, from which 7:56 hours 
were needed to read and filter the 67 GB of input, and 
only 50 minutes were necessary to cluster the filtered 
3,249,648 mappings, and generating the output. We con- 
sidered all mappings with distance at least three standard 
deviations larger than the mean. Those have been filtered 
and partitioned into 731,522 regions as described in 
"Method", only 288,552 of which contained at least two 
mappings. In total, 101,910 deletions supported by at 
least two mappings have been detected, comprising 104 
pairs of overlapping deletions. Table 1 gives an overview 
of the number of deletions, and the amount of overlaps 
found. 

Many highly overlapping deletions were found in telo- 
meric or centromeric regions, where mappings are not 
very reliable. Figure 4 exemplarily shows a region where 
we found overlapping clusters. The shown region is also 



an example for a general observation we made: Agglom- 
erative clustering with our similarity score nicely identifies 
outliers among the discordant mappings, as those remain 
as singleton clusters. In our approach, such putatively 
erroneous mappings do not distort the main deletion pre- 
diction, whereas, for instance in GASV, a single mapping 
can drastically affect a cluster if the corresponding trape- 
zoid overlaps it, or can result in many maximal sets of 
overlapping trapezoids. 

An experimental validation of this and further findings 
is currently being performed. Here, we wanted to give an 
overview of the general performance of the presented 
method on a real world instance. In cooperation with the 
Diisseldorf University Hospital, further analyses of this 
and other data sets are planned. 

Conclusions 

It is well known that structural variations in the human 
genome play an important role in the development of dis- 
eases, especially cancer. In particular, the accurate detection 
of deletions still remains a challenging task. A preliminary 
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Table 1 Number of overlapping deletions. 



Size (# deletions) 


0 


1 


2 


3 


4 


5 


6 


7 


8 


9 


> 10 


Total 


# regions 


631,769 


97,869 


1,768 


68 


25 


8 


2 


2 


3 


4 


4 


731,522 


# components 




101,615 


104 


9 


4 


3 


1 


2 


0 


1 


0 


101,739 



If the breakpoint region of a deletion, i.e., the interval defined by / and r of the cluster, overlaps another breakpoint region, these deletions are assigned to the 
same component. That means, non-overlapping deletions build components of size one, pairs of overlapping deletions build components of size two, etc. Note 
that deletions found in one region (cf. Section "Method") do not necessarily build one component. Below, the number of regions and components for different 
sizes are given. 



study has motivated that detection methods should be cap- 
able of handling overlapping deletions to be able to draw a 
clearer picture of variations in heterogeneous samples, e.g., 
to distinguish cancer from patient specific mutations. 

In the present study, we have demonstrated that 
agglomerative clustering is suitable for this task. We have 
introduced a similarity score that is based on a geometric 
and probabilistic interpretation of paired ends which 
have been mapped to a reference sequence. Taking into 
account both the location and the length of a deletion, 
this scoring allows to effectively cluster mappings into 
possibly overlapping clusters. The method has success- 
fully been applied on real data, and has proven to be 
accurate according to a simulation-based evaluation. 

Here, we have investigated the performance of an intui- 
tive clustering approach. On the one hand, the simplicity 
of agglomerative clustering offered us insight into the clus- 
tering process and understanding of its behavior. On the 
other hand, computationally more sophisticated clustering 
strategies could perform even better, if they are computa- 
tionally not too expensive for such large data sets. 

We believe that there is still more potential to be 
explored in the detection of structural variants, especially 
deletions. Besides the combination with other approaches, 
such as split read or coverage analysis, the increasing avail- 
ability of several samples from one patient offers a dimen- 
sion that has to be investigated. 

Note added in proof 

Another clustering based method to call deletions (and 
insertions) from paired-end mapping data from heteroge- 
neous samples has been published very recently. The tool 
SVM 2 by Chiara et al. [44] utilizes a support vector 
machine incorporating, similar to our tool, mapping loca- 
tion and mapping distance, but also coverage information. 
Including SVM 2 into our simulation-based evaluation 
could not be finished during revision of the present 
manuscript. 

Additional material 



Additional file 1: Results of the simulation-based evaluation Results 
of the simulation-based evaluation described in Section "Simulation of 
overlapping deletions" of the paper. The leftmost column specifies the 
number of simulated overlapping deletions (no deletion, single deletion, 
or pair of overlapping deletions) and the predictions (no deletion, single 



deletion, pair of overlapping deletions, or three or more overlapping 
deletions). The remainder of the table shows the corresponding counts 
per tool (agglomerative clustering, GASV [22], GASV with option 
"maximal", and CLEVER [37]) for four different settings (coverage 20x and 
60x, and scenario A and B, cf. Figure 6). 
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